Large vocabulary continuous speech recognition (LVCSR) has naturally been demanded for transcribing daily\nconversations, while developing spoken text data to train LVCSR is costly and time-consuming. In this paper, we\npropose a classification-based method to automatically select social media data for constructing a spoken-style\nlanguage model in LVCSR. Three classification techniques, SVM, CRF, and LSTM, trained by words and parts-of-speech\nare comparatively experimented to identify the degree of spoken style in each social media sentence. Spoken-style\nutterances are chosen by incremental greedy selection based on the score of the SVM or the CRF classifier or the\noutput classified as ââ?¬Å?spokenââ?¬Â by the LSTM classifier. With the proposed method, just 51.8, 91.6, and 79.9% of the\nutterances in a Twitter text collection are marked as spoken utterances by the SVM, CRF, and LSTM classifiers,\nrespectively. A baseline language model is then improved by interpolating with the one trained by these selected\nutterances. The proposed model is evaluated on two Thai LVCSR tasks: social media conversations and a\nspeech-to-speech translation application. Experimental results show that all the three classification-based data\nselection methods clearly help reducing the overall spoken test set perplexities. Regarding the LVCSR word error rate\n(WER), they achieve 3.38, 3.44, and 3.39% WER reduction, respectively, over the baseline language model, and 1.07,\n0.23, and 0.38% WER reduction, respectively, over the conventional perplexity-based text selection approach
Loading....